laplace kernel
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > Middle East > Israel (0.04)
On the Similarity between the Laplace and Neural Tangent Kernels
Recent theoretical work has shown that massively overparameterized neural networks are equivalent to kernel regressors that use Neural Tangent Kernels (NTKs). Experiments show that these kernel methods perform similarly to real neural networks. Here we show that NTK for fully connected networks with ReLU activation is closely related to the standard Laplace kernel. We show theoretically that for normalized data on the hypersphere both kernels have the same eigenfunctions and their eigenvalues decay polynomially at the same rate, implying that their Reproducing Kernel Hilbert Spaces (RKHS) include the same sets of functions. This means that both kernels give rise to classes of functions with the same smoothness properties. The two kernels differ for data off the hypersphere, but experiments indicate that when data is properly normalized these differences are not significant. Finally, we provide experiments on real data comparing NTK and the Laplace kernel, along with a larger class of $\gamma$-exponential kernels. We show that these perform almost identically. Our results suggest that much insight about neural networks can be obtained from analysis of the well-known Laplace kernel, which has a simple closed form.
Predicting kernel regression learning curves from only raw data statistics
Karkada, Dhruva, Turnbull, Joseph, Liu, Yuxi, Simon, James B.
We study kernel regression with common rotation-invariant kernels on real datasets including CIFAR-5m, SVHN, and ImageNet. We give a theoretical framework that predicts learning curves (test risk vs. sample size) from only two measurements: the empirical data covariance matrix and an empirical polynomial decomposition of the target function $f_*$. The key new idea is an analytical approximation of a kernel's eigenvalues and eigenfunctions with respect to an anisotropic data distribution. The eigenfunctions resemble Hermite polynomials of the data, so we call this approximation the Hermite eigenstructure ansatz (HEA). We prove the HEA for Gaussian data, but we find that real image data is often "Gaussian enough" for the HEA to hold well in practice, enabling us to predict learning curves by applying prior results relating kernel eigenstructure to test risk. Extending beyond kernel regression, we empirically find that MLPs in the feature-learning regime learn Hermite polynomials in the order predicted by the HEA. Our HEA framework is a proof of concept that an end-to-end theory of learning which maps dataset structure all the way to model performance is possible for nontrivial learning algorithms on real datasets.
- North America > United States > Rhode Island > Providence County > Providence (0.14)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > West Midlands > Birmingham (0.04)
$L_2$-Regularized Empirical Risk Minimization Guarantees Small Smooth Calibration Error
Fujisawa, Masahiro, Futami, Futoshi
Calibration of predicted probabilities is critical for reliable machine learning, yet it is poorly understood how standard training procedures yield well-calibrated models. This work provides the first theoretical proof that canonical $L_{2}$-regularized empirical risk minimization directly controls the smooth calibration error (smCE) without post-hoc correction or specialized calibration-promoting regularizer. We establish finite-sample generalization bounds for smCE based on optimization error, regularization strength, and the Rademacher complexity. We then instantiate this theory for models in reproducing kernel Hilbert spaces, deriving concrete guarantees for kernel ridge and logistic regression. Our experiments confirm these specific guarantees, demonstrating that $L_{2}$-regularized ERM can provide a well-calibrated model without boosting or post-hoc recalibration. The source code to reproduce all experiments is available at https://github.com/msfuji0211/erm_calibration.
- Asia > Japan > Honshū > Kansai > Osaka Prefecture > Osaka (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Spain > Andalusia > Granada Province > Granada (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.14)
- Asia > Middle East > Israel (0.04)
- North America > United States > Wisconsin (0.04)
- (2 more...)
the paper be accepted
Regarding our proof techniques, the proof in Thm. 1 for NTK with two layers and bias borrows techniques from [6]. Our proof technique for deep networks uses the algebra of RKHSs and is therefore novel in this context. Thm. 2 derives bounds that result from the relation between the Fourier expansion of the Laplace kernel in NTK (established in Thm. 4) and identifying the spaces fixed under the appropriate integral transform. "why they need additional parameters a, b, c." We note that analogously NTK becomes sharper for deeper networks.
- North America > United States > California > San Diego County > San Diego (0.05)
- North America > United States > Arizona > Maricopa County > Phoenix (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
Convergence of Mean Shift Algorithms for Large Bandwidths and Simultaneous Accurate Clustering
Pal, Susovan, Vepakomma, Praneeth
The mean shift (MS) is a non-parametric, density-based, iterative algorithm that has prominent usage in clustering and image segmentation. A rigorous proof for its convergence in full generality remains unknown. Two significant steps in this direction were taken in the paper \cite{Gh1}, which proved that for \textit{sufficiently large bandwidth}, the MS algorithm with the Gaussian kernel always converges in any dimension, and also by the same author in \cite{Gh2}, proved that MS always converges in one dimension for kernels with differentiable, strictly decreasing, convex profiles. In the more recent paper \cite{YT}, they have proved the convergence in more generality,\textit{ without any restriction on the bandwidth}, with the assumption that the KDE $f$ has a continuous Lipschitz gradient on the closure of the convex hull of the trajectory of the iterated sequence of the mode estimate, and also satisfies the Łojasiewicz property there. The main theoretical result of this paper is a generalization of those of \cite{Gh1}, where we show that (1) for\textit{ sufficiently large bandwidth} convergence is guaranteed in any dimension with \textit{any radially symmetric and strictly positive definite kernels}. The proof uses two alternate characterizations of radially symmetric positive definite smooth kernels by Schoenberg and Bernstein \cite{Fass}, and borrows some steps from the proofs in \cite{Gh1}. Although the authors acknowledge that the result in that paper is more restrictive than that of \cite{YT} due to the lower bandwidth limit, it uses a different set of assumptions than \cite{YT}, and the proof technique is different.
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Europe > Belgium (0.04)
Querying Kernel Methods Suffices for Reconstructing their Training Data
Barzilai, Daniel, Margalit, Yuval, Gronich, Eitan, Yehudai, Gilad, Galun, Meirav, Basri, Ronen
Over-parameterized models have raised concerns about their potential to memorize training data, even when achieving strong generalization. The privacy implications of such memorization are generally unclear, particularly in scenarios where only model outputs are accessible. We study this question in the context of kernel methods, and demonstrate both empirically and theoretically that querying kernel models at various points suffices to reconstruct their training data, even without access to model parameters. Our results hold for a range of kernel methods, including kernel regression, support vector machines, and kernel density estimation. Our hope is that this work can illuminate potential privacy concerns for such models.
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Kernel Methods (0.91)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.68)